Exploratory data analysis example - 1C Company
¶This notebook shows some exploratory data exploration of retail sales data in the context predicting future sales numbers of items at different retail outlets.
This data was provided by the Russian software publisher and retailer 1C Company, for a Kaggle competition in which the challenge is to predict monthly sales for specific products in specific shops.
First we'll list the provided files.
['items.csv', 'item_categories.csv', 'sales_train.csv', 'sample_submission.csv', 'shops.csv', 'test.csv']
We'll load all the files into the workspace as pandas dataframes. "sales_train" will be renamed to "train" for convenience.
Have a quick look at the train data.
| date | date_block_num | shop_id | item_id | item_price | item_cnt_day | |
|---|---|---|---|---|---|---|
| 0 | 02.01.2013 | 0 | 59 | 22154 | 999.00 | 1.0 |
| 1 | 03.01.2013 | 0 | 25 | 2552 | 899.00 | 1.0 |
| 2 | 05.01.2013 | 0 | 25 | 2552 | 899.00 | -1.0 |
| 3 | 06.01.2013 | 0 | 25 | 2554 | 1709.05 | 1.0 |
| 4 | 15.01.2013 | 0 | 25 | 2555 | 1099.00 | 1.0 |
| item_name | item_id | item_category_id | |
|---|---|---|---|
| 0 | ! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D | 0 | 40 |
| 1 | !ABBYY FineReader 12 Professional Edition Full... | 1 | 76 |
| 2 | ***В ЛУЧАХ СЛАВЫ (UNV) D | 2 | 40 |
| 3 | ***ГОЛУБАЯ ВОЛНА (Univ) D | 3 | 40 |
| 4 | ***КОРОБКА (СТЕКЛО) D | 4 | 40 |
| item_category_name | item_category_id | supercategory | platform | digital | supercategory_id | platform_id | |
|---|---|---|---|---|---|---|---|
| 0 | PC - Headsets / Headphones | 0 | Toys and misc | PC | 0 | 0 | 0 |
| 1 | Accessories - PS2 | 1 | Consoles and accessories | PS2 | 0 | 1 | 1 |
| 2 | Accessories - PS3 | 2 | Consoles and accessories | PS3 | 0 | 1 | 2 |
| 3 | Accessories - PS4 | 3 | Consoles and accessories | PS4 | 0 | 1 | 3 |
| 4 | Accessories - PSP | 4 | Consoles and accessories | PSP | 0 | 1 | 4 |
| shop_name | shop_id | |
|---|---|---|
| 0 | !Якутск Орджоникидзе, 56 фран | 0 |
| 1 | !Якутск ТЦ "Центральный" фран | 1 |
| 2 | Адыгея ТЦ "Мега" | 2 |
| 3 | Балашиха ТРК "Октябрь-Киномир" | 3 |
| 4 | Волжский ТЦ "Волга Молл" | 4 |
It looks like the training data has been structured as normalized tables for efficiency. For convenience we can merge the training data into a single table.
Check the datatypes of the columns.
date object date_block_num int64 shop_id int64 item_id int64 item_price float64 item_cnt_day float64 item_name object item_category_id int64 item_category_name object supercategory object platform object digital int64 supercategory_id int64 platform_id int64 shop_name object dtype: object
Most fields have appropriate datatype, although the numeric fields could potentially be downcasted to save memory.
The date field is formatted as a string, we can convert that to the datetime dtype to enable extra datetime features such as grouping by weeks or months.
Now we have prepared the dataframe we can check it for missing values.
date 0 date_block_num 0 shop_id 0 item_id 0 item_price 0 item_cnt_day 0 item_name 0 item_category_id 0 item_category_name 0 supercategory 0 platform 0 digital 0 supercategory_id 0 platform_id 0 shop_name 0 dtype: int64
There aren't any missing values to worry about, so we can have a quick look at the merged dataframe to get an idea of its contents.
| date | date_block_num | shop_id | item_id | item_price | item_cnt_day | item_name | item_category_id | item_category_name | supercategory | platform | digital | supercategory_id | platform_id | shop_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013-01-02 | 0 | 59 | 22154 | 999.00 | 1.0 | ЯВЛЕНИЕ 2012 (BD) | 37 | Cinema - Blu-Ray | Cinema | Blu-Ray | 0 | 4 | 12 | Ярославль ТЦ "Альтаир" |
| 1 | 2013-01-03 | 0 | 25 | 2552 | 899.00 | 1.0 | DEEP PURPLE The House Of Blue Light LP | 58 | Music - Vinyl | Music | Vinyl | 0 | 6 | 18 | Москва ТРК "Атриум" |
| 2 | 2013-01-05 | 0 | 25 | 2552 | 899.00 | -1.0 | DEEP PURPLE The House Of Blue Light LP | 58 | Music - Vinyl | Music | Vinyl | 0 | 6 | 18 | Москва ТРК "Атриум" |
| 3 | 2013-01-06 | 0 | 25 | 2554 | 1709.05 | 1.0 | DEEP PURPLE Who Do You Think We Are LP | 58 | Music - Vinyl | Music | Vinyl | 0 | 6 | 18 | Москва ТРК "Атриум" |
| 4 | 2013-01-15 | 0 | 25 | 2555 | 1099.00 | 1.0 | DEEP PURPLE 30 Very Best Of 2CD (Фирм.) | 56 | Music - CD of corporate production | Music | CD | 0 | 6 | 16 | Москва ТРК "Атриум" |
It looks like the rows are individual sales counts for specific shop-item combinations on specific days, with the number of item sales in the column "item_cnt_day". There is a negative value of item_cnt_day in row 2, so it looks like the data also includes entries for returned items.
The pandas "describe" method gives a good summary of the range of values in each numerical column. We'll round values to the nearest integer for clarity and append the number of unique values for each column.
| date_block_num | shop_id | item_id | item_price | item_cnt_day | item_category_id | |
|---|---|---|---|---|---|---|
| count | 2935849 | 2935849 | 2935849 | 2935849 | 2935849 | 2935849 |
| mean | 14 | 33 | 10197 | 890 | 1 | 40 |
| std | 9 | 16 | 6324 | 1729 | 2 | 17 |
| min | 0 | 0 | 0 | -1 | -22 | 0 |
| 25% | 7 | 22 | 4476 | 249 | 1 | 28 |
| 50% | 14 | 31 | 9343 | 399 | 1 | 40 |
| 75% | 23 | 47 | 15684 | 999 | 1 | 55 |
| max | 33 | 59 | 22169 | 307980 | 2169 | 83 |
| nunique | 34 | 60 | 21807 | 19993 | 198 | 84 |
From this we can see:
Some quick investigation finds that only one otherwise unremarkable entry has a negative price, which should be safe to remove.
The entries with very high price or sales values are so few in number they can be looked at individually and deleted if appropriate. The very highest valued entries are custom orders for large numbers of items that are probably best removed by removing items with values above an appropriate threshold.
There are 7356 columns (0.25% of the total) with apparently valid negative item_cnt_day values. We delete these because they cause problems when aggregating sales by month, such as months with negative sales totals.
The training data contains multiple shop_ids, not all of which are in the test set. To get an overview we can create a figure which plots total sales for each shop by month and displays the shop names in Russian and English. A table with a list of translated shop names is used for this.
A closer look at these plots finds several data cleaning issues, such as:
We will merge the duplicate shops and remove all data from shops not in the test month, for simplicity.
(A similar check of item categories finds no data quality issues)
Finally, we'll check for remaining duplicate entries in the training data.
| date | date_block_num | shop_id | item_id | item_price | item_cnt_day | item_name | item_category_id | item_category_name | supercategory | platform | digital | supercategory_id | platform_id | shop_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1435367 | 2014-02-23 | 13 | 50 | 3423 | 999.0 | 1.0 | Far Cry 3 (Classics) [Xbox 360, русская версия] | 23 | Games - XBOX 360 | Video games | XBOX 360 | 0 | 3 | 6 | Тюмень ТЦ "Гудвин" |
| 1496766 | 2014-03-23 | 14 | 21 | 3423 | 999.0 | 1.0 | Far Cry 3 (Classics) [Xbox 360, русская версия] | 23 | Games - XBOX 360 | Video games | XBOX 360 | 0 | 3 | 6 | Москва МТРЦ "Афи Молл" |
| 1671873 | 2014-05-01 | 16 | 50 | 3423 | 999.0 | 1.0 | Far Cry 3 (Classics) [Xbox 360, русская версия] | 23 | Games - XBOX 360 | Video games | XBOX 360 | 0 | 3 | 6 | Тюмень ТЦ "Гудвин" |
| 1866340 | 2014-07-12 | 18 | 25 | 3423 | 999.0 | 1.0 | Far Cry 3 (Classics) [Xbox 360, русская версия] | 23 | Games - XBOX 360 | Video games | XBOX 360 | 0 | 3 | 6 | Москва ТРК "Атриум" |
| 2198566 | 2014-12-31 | 23 | 42 | 21619 | 499.0 | 1.0 | ЧЕЛОВЕК ДОЖДЯ (BD) | 37 | Cinema - Blu-Ray | Cinema | Blu-Ray | 0 | 4 | 12 | СПб ТК "Невский Центр" |
There are only 5 duplicate entries, but the fact that 4 of them are for the same product suggests that they are errors, so we might as well drop them.
| date | date_block_num | shop_id | item_id | item_price | item_cnt_day | item_name | item_category_id | item_category_name | supercategory | platform | digital | supercategory_id | platform_id | shop_name |
|---|
The competition challenge was to predict monthly sales totals, so we should create a training set that follows the test format by summing sales for each month.
The format of the test items is a list of all possible combinations of shops and items for shops and items that recorded at least one sale in the test month, i.e. the Cartesian product of these shops and items. We recreate this by summing the sales for the Cartesian product of active shops and items sold in each month.
As before, merge the provided items, categories and shops tables.
Show the head of the table to check it looks ok.
| date_block_num | shop_id | item_id | item_cnt_month | item_revenue_month | item_name | item_category_id | item_category_name | supercategory | platform | digital | supercategory_id | platform_id | shop_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 59 | 22154 | 1.0 | 999.0 | ЯВЛЕНИЕ 2012 (BD) | 37 | Cinema - Blu-Ray | Cinema | Blu-Ray | 0 | 4 | 12 | Ярославль ТЦ "Альтаир" |
| 1 | 0 | 59 | 2552 | 0.0 | 0.0 | DEEP PURPLE The House Of Blue Light LP | 58 | Music - Vinyl | Music | Vinyl | 0 | 6 | 18 | Ярославль ТЦ "Альтаир" |
| 2 | 0 | 59 | 2554 | 0.0 | 0.0 | DEEP PURPLE Who Do You Think We Are LP | 58 | Music - Vinyl | Music | Vinyl | 0 | 6 | 18 | Ярославль ТЦ "Альтаир" |
| 3 | 0 | 59 | 2555 | 0.0 | 0.0 | DEEP PURPLE 30 Very Best Of 2CD (Фирм.) | 56 | Music - CD of corporate production | Music | CD | 0 | 6 | 16 | Ярославль ТЦ "Альтаир" |
| 4 | 0 | 59 | 2564 | 0.0 | 0.0 | DEEP PURPLE Perihelion: Live In Concert DVD (К... | 59 | Music - Music video | Music | DVD | 0 | 6 | 13 | Ярославль ТЦ "Альтаир" |
We can look at the the distributions of target and other features, and look for interesting patterns of relationships between variables.
The aim of this exploration should be to find patterns in the data that could help predict the target value, and identify the types of prediction model that are appropriate.
We plot an initial histogram of the target item_cnt_month feature, with a smoothed distribution estimate.
The distribution clearly has a very large peak close to zero. Creating sales counts for all items for all shops in each month might lead to lots of entries with zero sales, we should check what proportion of item counts are now zero.
Proportion of 0-valued targets is 0.8457804166978585.
The distribution of non-zero values would be obscured if all values were plotted together, so we plot the distribution of targets with zero values removed.
Proportion of targets greater than 1 is 0.052576717093067826.
Even with zeros removed, the target is very bottom-heavy, with only around 5% of values above 1, although a small number of items sell much more than this.
The skewness of the distribution makes linear models unsuitable for predicting future sales of items, as assumptions for linear models will not be met. Instead, non-linear models such as decision trees or k-nearest neighbor models would be more suitable.
We can also plot the distribution of items prices. For this, we take the mean price of the item in months in which it was sold. To do with we create a table which summarizes monthly data across all shops.
Prices are also highly skewed towards zero. We can display the distribution more clearly by using a log scale.
Here we can see that the price distribution is approximately lognormal with a peak slightly below 10.
Finally, we can also plot the joint distribution of monthly items sales and mean prices, again using a log scale for clarity.
C:\Users\grant\anaconda3\envs\eda\lib\site-packages\pandas\core\arraylike.py:364: RuntimeWarning: divide by zero encountered in log10 result = getattr(ufunc, method)(*inputs, **kwargs)
This reveals no strong overall associations between price and sales, although associations may exist in subgroups of the data.
Outliers are also apparent.
Plotting total sales counts per month shows clear downwards and seasonal trends. However the mean number of each item sold per month (which is what is to be predicted) shows a less pronounced downwards trend.
Mean sales per item can also be decomposed into seasonal and continuous trends using the statsmodels package. This show a clear yearly seasonal trend (particularly a peak around the winter holidays) and an overall downwards trend that can be assumed to related to the rise of internet and digital-only sales.
The overall sales trend is clearly downwards, but there are differences at the item category level. Compare the following trend and seasonal decomposition plots for games for the PS3 and PS4.
Each item is assigned one of 80+ categories which identify what kind of product it is and what format it is for. Information about these categories could be predictive because different types of item are likely to sell in different amounts.
First we plot mean sales and revenue per item in each category across all shops, for the last year of sales data.
Most obviously this plot shows that items in one particular category - "Gifts - bags, albums, mousepads" - have a much higher sales volume than items in other categories, but comparitively little revenue. This category contains few items and denotes low cost items such as promotional bags and mousemats.
Economic and predictive importance may be better represented by plotting summed rather than mean values in each category.
Plotting summed sales shows that movies and games are the highest-selling categories overall, with PS4 games accounting for the most revenue. As well as being predictive in itself, this information could be useful for deciding what categories to prioritize when building predictive models or allocating promotional resources.
Summed sales can also be plotted when data is grouped according to shop.
Some shops, paricularly those in Moscow, sell much more items overall than others. All else being equal, an item sold in these shops can be predicted to sell in greater quantities.
Some shops sell more than others, but are there differences in the relative quantities of items from each category that shops sell?
The individual summed sales per category profile of each shop can be created and is plotted here as a heatmap.
The vertical stripes in this heatmap indicate shops that differ from the mean in some way, but high-dimensional data like this can be difficult to understand without some kind of summary.
Principle component analysis (PCA) lets us decompose high dimensional data into a low dimensional representation that makes it easier to get an overview of patterns in the data. Although the shop-category sales counts do not have the ideal (normal) distribution for use with PCA, we can still do this to try to gain some insight into the similarities and differences between shops.
The explained variance plot shows that around 85% of the differences between the shops can be explained by two linear components, and that almost all of the shops lie approximately on a straight when plotted on these dimensions.
Insight into what these components correspond to can be gained by plotting the components that map the component scores to the original data. We do this here for the two components plotted above, and sort the elements of the components by descending magnitude.
Looking first at component 2, we see that this is mostly highly weighted on "digital" categories, i.e. non-physical online downloads. The fact that shop 55 is so far from the other shops on this dimension is explained by this being the ID of an online store, and highlights that shop 55 has a very different sales profile to the other shops and that it may help to handle this differently when making predictions.
Th fact that the non-digital shops lie more or less on a straight line in the principle component representation above indicates that the non-digital shops mainly differ in the magnitude of their sales volumes rather than differences in the types of items which they sell.
The age of items when they are sold can be approximately calculated by subtracting from this date the first date or month on which they were sold.
Total monthly item sales as a function of item age is plotted below for all items, and separately for items in two representative categories.
Plotting total monthly item sales as a function of their ages that items tend to sell most when they are new and then decline to a plateau about 1 year later. The slightly lower sales in the first compared to the second month is attributable to items not always being available for the whole first month.
It is also evident that this trend for items to sell most shortly after their release is more evident for some categories, such as movies, compared to others.
Even when taking the decline of sales volume over time into account, it seems likely that items that sell well in one month are likely to also sell well in the following month. A column can be created which contains the sales figures from the previous month for the sale shop-item combination.
We can create a regression plot of sales counts as a function of previous months sales, for a sample items which are at least a month old. For clarity, we use log scales on the axes and plot the estimate of the central tendency (mean) of item_cnt_month.
Adding feature item_cnt_month_lag_1
Sales in individual months are mostly low-valued and tend to be noisy. To reduce this noise, windowing methods can be used to calculate a historical mean as a weighted sum of the sales from multiple previous months. An example showing multiple types of window is shown below.
Creating feature "shop_id_item_id_item_cnt_month_mean_ewm_hl_1" Creating feature "shop_id_item_id_item_cnt_month_mean_rolling_mean_win_12"
We create windowed 12-month average and exponential moving average (in which recent months are weighed more than less recent months) sales count features and display regression plots below.
Adding feature item_id_item_cnt_month_mean_lag_1
While previous months sales are useful, for new items this is not available and alternative information must be used to make predictions.
The item category and shop id fields can be used to determine mean item sales counts for items in their first month of sales. This is plotted below for the last year of sales data.
Again, the mean new item sales for item category - shop combinations can also be calculated.
In addition to the information contained in the item categories and shop identities, all items have an associated item_name text feature, which contains a short description of the item that often includes things such as its title, format (e.g. PS4 or PS3) and language.
To aid extraction of information from text it is often useful to clean the text of irrelevant special characters, excess blank spaces, low-information words such as "the", and converting all text to lowercase. If necessary this can be performed by regular expression operations, as demonstrated here:
| item_name | item_name_clean | |
|---|---|---|
| 1000 | 3D Action Puzzle "Зомби" Уборщик | 3d action puzzle зомби уборщик |
| 1001 | 3D Action Puzzle "Зомби" Шахтер | 3d action puzzle зомби шахтер |
| 1002 | 3D Action Puzzle "Техника" Бомбардировщик | 3d action puzzle техника бомбардировщик |
| 1003 | 3D Action Puzzle "Техника" Вертолет | 3d action puzzle техника вертолет |
| 1004 | 3D Action Puzzle "Техника" Гоночная машинка | 3d action puzzle техника гоночная машинка |
One way that the item_name field could be used is by extracting individual words or n-grams (groups of sequential words) and treating them as individual binary categories. Doing this creates a very large number of features so some kind of filtering will likely be necessary, such as specifying minimum numbers of occurences of a word feature. Another feature selection technique is to filter items according to some kind of metric of relevance to the target variable, such as correlation.
Below is an example of the 1 and 2-ngrams producted from a single item_name text string.
9 ngrams found in all items
| item_name | ng: 360 | ng: 360 английская | ng: fuse | ng: fuse xbox | ng: xbox | ng: xbox 360 | ng: английская | ng: английская версия | ng: версия | |
|---|---|---|---|---|---|---|---|---|---|---|
| 3566 | Fuse [Xbox 360, английская версия] | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Items with different item_ids are often related to each other, such as being different versions of the same video game or movie, and so are likely to have related sales figures. This can be taken advantage of by grouping similar items together based on their item_names.
The Python package TheFuzz implements fuzzy string matching to measure the similarity of sequences of word. The following code uses this with an alphabetical sorting of item names to group related items together. This item name group can be used as other categorical features.
The example list of items after shows that related items (e.g. the same video game for different consoles) are assigned to the same group.
| item_name | item_name_group | |
|---|---|---|
| 3565 | Fuse [PS3, английская версия] | 1362 |
| 3566 | Fuse [Xbox 360, английская версия] | 1362 |
| 3567 | G Data Internet Security 2013 (1ПК / 1 год) (G... | 1363 |
| 3568 | G Data Internet Security 2013 (3ПК / 1 год) (G... | 1363 |
| 3569 | GABIN The Best Of Gabin 2CD | 1364 |
Categorical features such as name groups and shop ids have useful predictive information, but the relationship between category values and the target variable is not consistent over the training data, as items, item categories and shops increase or (more often) decline in popularity over time.
Theoretically, a model could learn the interactions between individual category values and time periods, but for categorical features with high numbers of values, such as item identities and name groups, this could require very complex models and cause problems with overfitting.
To save the predictive model having to learn the individual relationships between individual category values and specific time periods, a useful solution is to turn categorical variables into numerical variables by reencoding each value of the category as the mean of the target variable with items with this value, for some time window before the time of the current item. As with individual items, different time windows can be used.
Here are show example windowed mean encodings for 3 values of the item category name feature using 3 different temporal windows.
Creating feature "item_category_name_item_cnt_month_mean_ewm_hl_1" Creating feature "item_category_name_item_cnt_month_mean_rolling_mean_win_12" Creating feature "item_category_name_item_cnt_month_mean_rolling_mean_win_1"
Define a function to fit and return a lightgbm regressor with or without early stopping
Split train and validation sets from the feature matrix, month 33 used as validation set
These hyperparameters were found by using the hyperparameter optimization framework Optuna to optimize hyperparameters for the validation set.
Fit the booster using early stopping with the validation set
Training until validation scores don't improve for 30 rounds [100] training's rmse: 0.856163 training's l2: 0.733015 valid_1's rmse: 0.80707 valid_1's l2: 0.651363 [200] training's rmse: 0.752276 training's l2: 0.565919 valid_1's rmse: 0.754368 valid_1's l2: 0.569071 [300] training's rmse: 0.711774 training's l2: 0.506623 valid_1's rmse: 0.744477 valid_1's l2: 0.554246 [400] training's rmse: 0.687697 training's l2: 0.472927 valid_1's rmse: 0.742502 valid_1's l2: 0.551309 [500] training's rmse: 0.671048 training's l2: 0.450306 valid_1's rmse: 0.741354 valid_1's l2: 0.549606 Early stopping, best iteration is: [561] training's rmse: 0.662931 training's l2: 0.439477 valid_1's rmse: 0.740871 valid_1's l2: 0.54889
LGBMRegressor(cat_smooth=45.01680827234465, colsample_bytree=0.8,
learning_rate=0.01, max_bin=214, min_child_samples=27,
min_child_weight=0.021144950289224463, min_data_in_bin=7,
n_estimators=8000, num_leaves=966, subsample=0.6,
subsample_for_bin=300000, subsample_freq=5)
Wall time: 1min 6s
Passing parameters norm and vmin/vmax simultaneously is deprecated since 3.3 and will become an error two minor releases later. Please pass vmin/vmax directly to the norm when creating it.
.values =
array([-8.19686053e-04, -8.53319023e-04, -9.65044160e-04, 1.15730939e-03,
1.91875521e-03, -1.24212491e-02, -4.65595700e-03, -2.87423533e-03,
1.15643118e-02, -5.64147707e-02, 4.52157653e-03, -8.60860456e-02,
-1.60811582e-02, -2.53377435e-03, 9.05928950e-04, 1.25487368e-02,
7.58401193e-03, 2.22618227e-03, 1.61029907e-03, 2.08944576e-03,
-1.25950408e-02, 8.02309280e-04, -1.39487404e-05, -4.45220132e-02,
4.26772887e-03, -2.99258343e-02, -4.58974255e-02, -4.54617575e-05,
-3.46940177e-06, -1.12505176e-05, -6.63152608e-05, -5.25955020e-05,
3.56156902e-05, -2.86672652e-05, -1.46040717e-05, -1.61875122e-05,
-7.05230762e-05, -2.14635565e-07, -3.16203682e-06, -1.77613791e-05,
5.66886008e-05, -7.81802911e-05, -2.87233743e-05, 1.99339850e-05,
-3.34961256e-07, -1.80680328e-06, -1.10293938e-04, -1.34897610e-07,
-1.97895964e-05, -1.53677586e-04, -5.42197848e-05, -5.55047043e-06,
-3.07481982e-05, -5.25976003e-05, -3.32999314e-05, -1.55060294e-05,
-4.70542271e-06, -1.06448722e-05, -6.91367638e-06, -5.58492298e-06,
-2.93854486e-05, -6.63007828e-06, -3.74690969e-04, -6.40932595e-06,
-2.85863902e-05, -6.50666129e-05, -1.28759809e-05, -4.94752420e-07,
-8.77806511e-06, -9.38369106e-06, 1.73708768e-04, -3.82167598e-04,
-1.19688245e-04, -1.41374098e-04, -1.81064610e-04, -1.46671203e-04,
-1.86676758e-05])
.base_values =
0.3103335467195565
.data =
array([3.3000000e+01, 4.3000000e+01, 1.0010000e+03, 7.6600000e+02,
1.0020000e+03, 1.9900000e+02, 1.0000000e+01, 2.7990000e+03,
5.8000000e+01, 7.3486328e-02, 5.8935547e-01, 7.8125000e-03,
8.3312988e-02, 2.7441406e-01, 2.3791504e-01, 4.4403076e-02,
4.5227051e-02, 1.7565918e-01, 3.6865234e-01, 2.6186111e+02,
0.0000000e+00, 0.0000000e+00, 1.8000000e+01, 2.4390243e-02,
3.8085300e-02, 8.2524270e-02, 0.0000000e+00, 0.0000000e+00,
0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
0.0000000e+00, 0.0000000e+00, 0.0000000e+00, 0.0000000e+00,
0.0000000e+00], dtype=float32)
array([[ 33., 43., 1001., ..., 0., 0., 0.],
[ 33., 47., 157., ..., 0., 0., 0.],
[ 33., 90., 9999., ..., 0., 0., 0.],
...,
[ 33., 33., 929., ..., 0., 0., 0.],
[ 33., 43., 970., ..., 0., 0., 0.],
[ 33., 50., 21., ..., 0., 0., 0.]], dtype=float32)
| date_block_num | item_name_length | first_item_sale_days | first_shop_item_sale_days | first_name_group_sale_days | last_shop_item_sale_days | month | last_item_price | item_category_id | shop_id_item_category_id_new_item_item_cnt_month_mean_rolling_mean_win_12 | ... | word_контроллер | word_оплаты | word_охота | word_регион | word_русская | word_русские | word_специальное | word_субтитры | word_цифровая | word_черный | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 544128 | 2 | 28 | 56 | 12 | 59 | 12 | 3 | 792.474976 | 69 | 0.099976 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 544129 | 2 | 29 | 58 | 9999 | 58 | 9999 | 3 | 138.571426 | 40 | 0.097656 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 544130 | 2 | 31 | 17 | 13 | 17 | 5 | 3 | 673.148376 | 37 | 0.066406 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 544131 | 2 | 48 | 58 | 3 | 59 | 1 | 3 | 921.662659 | 19 | 0.307617 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 544132 | 2 | 47 | 58 | 9999 | 59 | 9999 | 3 | 345.366211 | 30 | 0.570801 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8421038 | 32 | 26 | 394 | 9999 | 398 | 9999 | 9 | 1249.000000 | 61 | 0.040710 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 8421039 | 32 | 29 | 386 | 25 | 398 | 25 | 9 | 1165.666626 | 61 | 0.040710 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 8421040 | 32 | 11 | 880 | 637 | 970 | 637 | 9 | 199.000000 | 37 | 0.121094 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 8421041 | 32 | 29 | 317 | 9999 | 398 | 9999 | 9 | 1249.000000 | 61 | 0.040710 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 8421042 | 32 | 29 | 11 | 9999 | 308 | 9999 | 9 | 1799.000000 | 61 | 0.040710 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7876915 rows × 77 columns